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ABSTRACT 

Four procedures were used to estimate a 
criterion-referenced standard for a multiple-choice examination 
developed by the National Board of Medical Examiners (NBME) . Two 
experimental procedures, the NBME method &nd a modification of the 
Guerin method, and the Angoff and Ebel procedures were evaluatedwoq 
the consistency of the estimates th^y yielded, the plausibility of 
the failure rates, and the standard-setters' confidence in thfeir • 
judgments. The NBME and modified Guerin procedures yielded the most- 
consistent and least consistent estimates, respectively. The failure 
rates associated with the standards obtained using these procedures 
were higher than^t^e failure rate associated with the test's 
norm-referenced standard, but only the failure rate associated with 
the modified Guerin procedure was obviously unacceptable. The 
standard-setters said it was difficult to judge the success rate of 
"minimally knowledgeable examinees" with the test questions, but even 
more difficult to make those judgments for the hypothetical 
classifications of items used with the Ebel procedure. The estimates 
obtained using three of the procedures were relatively consisterit and 
the failure rates associated with them, although higher than the rate 
experienced with a nqrm referenced standard, were plausible. 
(Author/PN) 
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ABSTRACT 



The Angoff and Ebel procedures were used to estimate a criteri\>n-ref erenced 
standard on a written test. Two experimental procedures, the NBME method and a 
modification of the Guerin method, also were used to estimate a standard for the 
test. The four procedures were compared in terms of the consistency of the 
estimates they yielded, the plausibility of the failure rates, and the 
standard-setters 1 confidence in their judgments. 

The NBME and modified Guerin procedures yielded the most consistent and least 
consistent estimates, respectively. The failure rates ^associated with the 

standards obtained using these procedures were higher than the failure rate 

d 

associated with the test's norm-referenced standard, but only the failure rate 
associated with the modified Guerin procedure was obviously unacceptable. The 
standard-setters said it was difficult to judge the success rate of "minimally 
knowledgeable examinees" with the test questions, but even more difficult to 
make, those judgments for the hypothetical classifications of jitems used with the 
Ebel procedure. 

t- 

The findings were encouraging in several respects. The estimates obtained using 
three of the procedures were relatively consistent and the failure rates 
associated with them, although higher than the .rate experienced with a norm 
referenced standard, were plausible. Only the modified Guerin technique yielded 
an inconsistent ^estimate' with an obviously unacceptable failure rate, and thqse 
findings may say" more about the modifications made to Guerin' s procedure than 
about the unaltered procedure. 



Comparing Four Estimates of the Criterion-Referenced Standard 

for a Written Test 

by 

Francis P. Hughes* Ph.D.l 
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^ INTRODUCTION 

Measurements are used to make dec i s i on s about • ndi v i dual s ? 

and in an educational setting they often are used to determine 

whether an i nd i vi dual has ach ieved des i red i nst ruct i onal goal s • 

If such decisions are to be valid? the standard that is used must 

be appropriate and acceptable as an indicator of H minimally 
acceptab 1 e ach i evemen t. " 

Pr i or to the date when the Nat i onal Board replaced i ts 
written essay examinations with Isngthy multiple choice tests? an 
examination standard was set collectively by the examiners as 
they graded the essay responses. The examiners applied their 
personal standards for "minimally acceptable achievement 11 ? and 
the qroup's standard represented a consensus of their personal 
judgments. With the introduction of its multiple choice^ 
examinations? the National Board formally adopted a 
norm-referenced standard that resulted in a failure rate similar 
to the one experienced when individual examiners applied their 
criterion-referenced standards to the grading of essay responses. 

Setting a norm-referenced standard at a specified level in 
the distribution of test scores for a well-defined group of 
examinees requ i res judgment s that are different from those needed 
to set a criterion-referenced standard. Norm-referenced 
standards require judgments about the definition of an 
appropriate reference group and tie percentage of examinees in 
that group whose achievement i$ not likely to be "minimally 
acceptable. 11 The emphasis is on the reference group and onl y 
indirectly on the knowledge that represents "minimally acceptable 
achievement." 

Criterion-referenced standards? on the other hand? are based 
on judgments about what examinees whose achievement is "minimally 
acceptable" actually know of the content domain in which they are 
being examined. These judgments are expressed with respect to 
the content of items developed to assess that domain. 

Regardless of the judgments on which a standard is based? it 
is useful to distinguish between the tasks of estimating the 
standard for a content domain and selecting the cutting score for 
an examination developed to measure knowledge of that domain. In 
this author f s opinion? estimates* of an examination standard 
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should be based on one' consideration only: the level of 
achievement that is "minimally acceptable" or stated differently? 
what "minimally knowledgeable examinees" or "MKEs" actually know 
of the content domain. Selecting a cutting scor^t however? 
involves consideration of the estimated standard % and its 
plausibility as well as the educational and societal impact of 
the c utt i ng sco re and the 1 i kel i hood of er roneous dec i s i ons and 
their consequences for the examinees and for society. (Mi 11 man? 
1973 ) 

Est i mat ing a cr i ter i on-referenced standard? from a 
psychometr i c perspective? requi res a procedure that will 
translate the standard-setters 1 judgments about what "MKEs" know 
into test scores* Since many procedures for estimating such a 
standard have been described in the psychometric literature? the 
choice of procedure is a decision that may influence the estimate 
of the standard. This investigation was conducted to provide 
i nf dr mat ion that coul d hel p an exami n i ng agency choose among four 
standard setting procedures involving judgments about ''test 
quest i on s • 

The four procedures were evaluated using the estimates they 
yielded of the examination standard. The evaluation focused on 
the consistency of the estimated standards as indicated by their 
standard errors? th£ plausibility of the failure rates associated 
with the estimates/ the accuracy of the judgments on which the 
estimates were based? and the confidence and ease with which 
standard-setters and psychometr ici ans could use the procedures 
with on-going examination programs. 



REVIEW OF N^F RESEARCH 

The National Board is very nuch aware of the continuing 
discussion in the psychometric literature regarding the issue of 
standard setting in general .and the merits of normative and 
criterion standards in particular. During the past decade it has 
supported? either by itself or in cooperation with its client 
organizations? numerous research studies comparing the use of 
norm- and criterion-referenced standard setting procedures. 

Andrew and Hecht (1976) founJ that the method described by 
Nedelsky (1954) yielded a much lower standard for a nationally 
administered certifying examination in the health professions 
than the method described by Ebel (1972). Guerin? Burg? and 
Vaughan (1978) reported that the standards for two recertifying 
examinations obtained using a modified Nedelsky technique were 
similar to the nor m- referenced standards set for those 
examinations. Guerin? But/in and Schumacher (1982) investigated 
a new procedure and found that it yielded an acceptable 
criterion-referenced standard for a recertifying examination that 
was not too different from the standard that would have been 
obtained had the modified Nedelsky nethod been used. They also 
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reported that different, groups of standard-setters made similar 
judgments?> a finding reported by Andrew and Hecht too. Hughes 
(1981) described another method that produced increased agreement 
amonq the standard-setters about the choice of a 
criterion-referenced standard each time they revised their 
• previous judgments on the basis of new feedback information. 
However? the -estimate of the standard tended to fluctuate* rather 
than converge? after three iterations with the procedure and it 
differed from the normative standard set for tests similar to the 
prototype examination used in the study# 

The results of these studies are encouraging since they 
suggest that appr op r i ate -and accep table criteri on-referenced 
standards can be sett However? tney also suggest that the choice 
* of method may affect the criterion-referenced standard that is, 
set. Therefore? setting a criterion-referenced standard not only 
requires a decision by the examining agency about who the 
standard^setter s will be? it also involves the choice of a 
psychometric procedure to translate their judgments about 
"minimally acceptable ach i evemei t" into a test sicore. The 
current study was conducted to obtain information the National 
Board could use to guide its choice of a criterion-referenced 
standard setting procedure? should the Board decide to alter its 
present approach to standard setting* 



STANDARD SETTING PROCEDURES 

The methods described by EbRl (1972)? Guerin et al . (1982) 
aad Hughes (1981) were the subject of previous NBME studies and 
also were investigated in this study. The Anqoff procedure 
(1971) was included in this study rather than the Nedelsky 
technique (1954) because it does not constrain the 
standard-setters* judgments by the number of choices examinees 
have when responding to an item* 

The Angoff and Ebel methods use only the standard-setters* 
judgments about the content of the test items to estimate the 
standard. The Guerin procedure rind the procedure described by 
Huqhes? hereafter referred to .as the NBME procedure? use the 
standard-setters 1 judgments and psychometric data obtained from a 
Rasch item calibration to estimate the standard (Rasch? I960; 
Wright? 1^68 and 1977; Wright and Stonet 1979). The calibrated 
item difficulties are independent of the examinees whose 
responses were used to conduct the calibration and of the sample 
of items drawn from the item pool to construct the particular 
examination. Therefore? it is not necessary to await the 
calibration cJf the current form of the examination before 
commencing the standard-setting activities? since judgments about 
pool items that have been praviously calibrated to the 
examination scale can be used to estimate the standard. Neither 
is it necessary to use pool iteTis that are included in the 
current form of the examination dhen estimating the standard? 
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al though it may be desirable, to do so. 

The standard-setters were selected for their expertise in 
one of the ^ix clinical science disciplines comprising the test; 
therefore? they only made judgments about the items in their 
clinical science subtest^ Restricting the standard-setters* 
judgments to items in t heij cl i ni z al d i sc i pi i ne was facilitated 
by the Rasch item calibration which estimated the difficulty of 
all items on a common scale. Tha NBME and Guer i n procedures 
yielded an estimate of each standard-setters personal standard 
on the total test because all iten difficulties were calibrated 
to the same scale. The Angoff and Fbel procedures yielded an 
estimate of each personal standard on the discipline subtest? and 
Rasch procedures were used to equate the subtest score to a score 
on the total test. 



The Angoff method requires each standard-setter to judge the 
MICE'S • success rate with ever y i ten in his or her clinical sci ence 

subtest. The success rate is the judge f s estimate of the ■., 

proportion of MKEs answering the, item correctly. The. sum of 

these success rates over all i ta<ns in the subtest is the 

standard-setters* estimate oT tie MKEs • subtest score. The 

equivalent score on the total test is an estimate of the 
standard- setter • s per sonal s t andar d. 



The NBME procedure uses the same judgments about the MKEs r 
success rate in conjunction with the calibrated difficulty of the 
items to estimate the standard-setters personal standard. This 
is done using the Rasch model which postulates that the 
probability of a correct response to a test item (P) is a 
function of the examinee § s knowledge (b) and the item f s 
difficulty (d). The model hypothesizes that for every test item 
(D-d) = 1 og(P/( l-P) )• The NBME procedure regresses the 
calibrated item difficulty on the logarithmic transformation of 
the MKEs 1 success rate. The regression line intercepts, the 
difficulty axis at the point on tie log-odds axis where (b-d) = 
0. Since b = d at this point? the intercept (d) is an estimate 
of the MKEs 1 knowledge (b) as measured'on the calibration scale. 
The total test score equivalent to this measurement is the 
estimate ^of the standard-setter »s personal standard. (See Figure 
I for an illustration of this metiod.) 

The Guerin procedure uses the calibrated difficulty of the 
test it ems and the standard-setters* judgments about the 
relevance of the items i content to estimate their personal 
standards. Each standard-setter rates the items in his or her 
clinical science subtest as Essential? Important? Acceptable or 
Questionable? but only the items judged to be Essential are used 
to est i mate the per sonal standard. The Rasch model est i mates 
item difficulty and examinee achievement on the same measurement 
scale? and the Guerin procedure defines the point on that 
calibration scale occupied by the most difficult of the Essential 
items as the MKEs • achievement level. The total test score 
equivalent to that measurement is the standar d-setter • s personal 
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standard. (See Figure 2 for an illustration of this procedure.) 

The Ebel method uses judgments about the MKEs* success rate 
with hypothetical item-types characterized by relevance and 
difficulty to estimate the standard-setters 1 personal standards. 
Four categories of item relevance (Essentialt Important? 
Acceptable and Questionable) and item difficulty (Easy f On The 
Easy Side f On The Hard Side and Hard) are defined. The 
standard-setter first judges the ^KEs # success rate with each of 
the 16 hypothetical item-types? then classifies every item in the 
clinical science subtest according to his or her perception of 
its relevance and difficulty. The success rate for a category is 
used as the MKEs • probability of success with every item 
classified in the category? and "minimally acceptable 
achievement" on the subtest is estimated by* summing the 
probabilities over all the items. The equivalent score on the 
total test is the estimate of the standard-setter • s personal 
standard. 



COLLECTING THE STANDARD -SETTERS • JUDGMENTS 

A recent National Board Part II Examination was used to 
estimate a cr i ter ion- referenced standard for the content domain 
the test was developed to assess. It contained 862 multiple 
choice items that were used to obtain a total test score. These 
items were distributed in roughly equal numbers to the Internal 
Medici ne? Surgery ? Obs tetr i c/Gynecol ogy f Preventive 
Medicine/Public Healtht Pediatricst and Psychiatry subtests. The 
National Board evaluates its candidates in each clinical science 
discipline? but it uses a single score for the total test to 
determine whether an examinee's achievement is "minimally 
acceptable. " Thereforet a cr i ter i on-referenced standard was 
estimated for the total test using each of the psychometric 
procedures being investigated. 

To do this a panel of standar d- se tter s was formed consisting 
of twel ve^medical educators with previous experience as members 
of the National Board's Part II Test Committees. The standard 
setters were chosen for their recognized expertise in one of the 
clinical science disciplines assessed by the examination and for 
their experience in writing items for f and in construe t i ng f Part 
1 1 Exami nations. There were two standar d- setter s for each 
clinical science discipline. 

The twel ve standard- setters met in Phi 1 adelphia for a 
two-iay Orientation meeting. Befara the meeting they were sent 
an overview of the studyt information about the various judgments 
they would be asked to make f and a small sample of items in their 
cl inical science di scipl ine. Two groups of examinees were 
defined for them: MKEs and TUSMGs whose knowledge is "typical of 
graduates of JS medical schools". MKEs were defined as 
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individuals who have just been awarded the MD 
degree by a .US medical school and whose level of 
medical knowledge is the minimum acceptable for safe 
and effective medical practice? under supervision? at 
the beginning of residency training.* 

TUSMGs were described as typical graduates of US medical schools 
who have attained ^ 

%•• a level of medical knowledge beyond the minimum 
acceptable for safe and effective medical practice 
under super v i s i on. * 

The standard-setters were i ns thucted to review the sample of 
items sent to them and to make the judgments needed to estimate 
the MKEs ■ and the TUSMGs • achievement on the item sample using 
each procedure being investigated. They completed this 
M i nstruc t i onal exercise" before coming to the meeting and it 
served as the basis for a brief training session during, the 
meet i rig* 

Fol 1 ow i ny the meet i ng and act i ng i ndependentl y of one 
another* the standard-setters were asked to review every item in 
their clinical science discipline and make the following 
judgments in the order indicated: (I) specify ^the success rate 
for MKEs and for TUSMGs with each of the hypot^tical item-types 
characterized by relevance and difficulty? (2) classify each item 
according to their perception of its relevance and difficulty? 
(3) specify the success rate for MKEs and for TUSMGs with each of 
the items. The appropriate judgments were used with each 
procedure to obtain the standard-setter's estimate of the MKEs' 
ancKjhe TUSMGs 9 score on the total test. 

The average of the standard- set ters ■ personal standards was 
the estimate of the yroup § s examination standard. The 
consistency of the group § s estimate was expressed as a standard 
error* calculated by dividing tie standard deviation of the 
personal standards by the square root of the number of 
standard-setters. An esti mate of .the group 1 s exami nat i on 
standard was obtained in this manner us i ng tne personal standards 
estimated with each of the procedures being studied. 



PRESENTATION 3F THE OAT A 

Estimates of the standard-setters* personal standards and 
the group's examination standard are reported as percent scores 
rather than in the standard score metric used by the National 
Board. This type of scors often is used when 
cr i ter ion- referenced standards ars Dei ng considered because by 
implication it associ ates* the standard with a mastery level of 
the content domain. ' 
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Consistency of the Estimated Examination Standards 



The NBME procedure yielded the most consistent of the 
estimates (see Table I) with a standard error of 1.8 percent 
score units. The Guerin procedurs yielded the least consistent 
estimate? the standard error beiny 3.3 score units. Both the 
Anqoff and the Ebel p rocedures y i e 1 ded es t i mates with a standard 
error of 2*5 unitSt 

The consistency of these estimates was improved by computing 
the exami nati on * standard as tie average of the standards 
estimated ft>r the c 1 i ni cal „ sc • ence disciplines. (See Table 2.) 
Jhe average of the personal standards for judges with expertise 
in the same clinical science was used as the standard for that 
discipline. When the examination standard was computed in this 
manner? the standard error was 1.2 using the NBME procedure? and 
2.0t 1.8? and 3.2 respectively using the Ebel ? Angoff and Guerin 
procedures. This finding indicates that the variability within 
disciplines was greater than the *ariaHility between disciplines 
and suggests that di fferences among the standard-setters are 
individual differences unrelated to the clinical discipline in 
which they are expert. 

Plausibility of the Estimated Fxa-nination Standards 
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Plausibility was assessed in two ways. One* i nvol ved a 
comparison of the failure rates t^iat would have occurred had each 
of the criterion-referenced standards been used with the failure 
rate that did occur when the norn-referenced standard was used* 
The other involved an evaluation of the accuracy of the 
standard-setters* judgments about success rates with individual 
! terns • 

The National Board reference group only contains examinees 
who are ►n their final year at a JS medical school? are taking 
the Part II Examination for the first time? and are candidates 
for Nb^E certifications Because nedical educators have accepted 
a failure rate of 2.4* in this group for many years? the 
norm-referenced standard has been considered a sensible one. The 
standard estimated using the Guerin procedure (see Table 3) would 
have resulted in an 85.5% failure rate in the reference group? 
wh i ch c 1 ear 1 y woul d be unacceptab 1 e • The standards est i mated 
using the NBME ? E be 1 and Angoff pr ocedures all had failure rates 
higher than 2.4%; however? their respective failure rates (3.5%? 
6.0* and 8.3*) did not differ too greatly from the normative 
failure rate and might De considered acceptable. 

It was not poss i ble to assess the accuracy of the 
standard-setters* judgments about the MKE s • success rate with 
i ndi vi dual test i terns because p-va 1 ues coul d not be deter mi ned 
for examinees whose achievement was "minimally acceptable. 11 It 
was possible? however? to compare their judgments about the 
TUSMGs* success rate with item p-*a)ues based on the responses of 
the N'ational doard reference group and to compare estimates of 
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the TUSMGs* average score witi the average .score ; for the 
reference group. These comparisons provided information (see 
Tables 4-5) that was used to evaluate the plausibility of the 
standard- setter s • judgments. 



In general? the difference 
TUSMG judgments and the reference 
10 for about <tQ% of the items i 
However ? it exceeded 15 for about 

) The average difference for 
positive and indicated a tendency 
success* For three of them the 
extreme as shown by an average dif 
two standard-setters tended to und 
but neither did by very much* Bee 
standard-setters* TUSMG judgment 
reason to question the accuracy of 



between the s tandard- setter s ■ 
group p-values did not exceed 
n their clinical discipline* 
45% of those items* (See Table 
ei ght standard-setters was 

to overestimate the TUSMGs* 
overestimate was rel at i vel y 
fere nee exceeding ♦ 10* Only 
er est i mate the TUSMGs 9 success t 
ause almost one-half of the 
s were inaccurate? there is 

the i r MKE judgments * 



Two of the three st andar d -set t er s whose tendency to 
overestimate TUSMG success was relatively extreme (ME02 and 
PMPH2) also had personal standards that were much higher than the 
estimate for the group* Both jurjjes *ho tended to underestimate 
the TUSMGs • success (PMPHi and PE3S2) had personal standards that 
were much lower than the group estimate* Because judges who^e 
TUSMG judgments tended to be inaccurate usually had personal 
standards that were extreme? there may be reason to question the 
accuracy of their personal standards as estimates of the 
examination standard* 

The NBME? Ebel ? and Angoff procedures were used to estimate 
the average score for the group of TUSMGs* (See Table 5*) The 
Guer irr procedure was not used because it only defined a rule for 
determining tne achievement level of < MKE s • The average score 
achieved by the National Board reference group was ^65*4%* The 
estimate of the TUSMGs* achievement was 60*1% using the NBMF 
procedure? 65*1% using Ebel*s procedure and 69*3% using the 
Angoff method* The accuracy of tie estimate obtained using the 
Ebel procedure suggests that groua iig it ems may help impr ove the 
accuracy with which the examination standard is estimated* The 
estimates of the TUSMGs* average score were mofe consistent than 
the estimates of the MKEs* score possibly because the judges are 
more familiar with typical medical students and their level of 
achievement* 



Feasibility of Imp 1 ement i ng the Pr ocedur es 



ERIC 



Different procedures use different judgments to estimate the 
examination standard* Therefore? the opinions of the 
standard-setters regarding the comparative ease of making those 
judgments and their confidence in them were used to help evaluate 
the feasibility of using the standard-setting procedures with 
operational examination programs* 

In general? the standard-setters said it was easy to 



classify test items according to their perceptions of the 
relevance and difficulty of the content! as required by the Ebel 
procedure? but relatively difficult to judge success rates with 
hypothetical item-types characterized by relevance and 
difficulty. Most wer^ not sure haw changes in relevance or the 
interaction between relevance and difficulty should affect their 
judgments about the M KE s • success rate. Therefore* they were 
uncertain of their judgments and? by inference? lacked confidence 
in the estimate of their personal standard based on those 
judgments. 

The standard-setters also said it was difficult to judge the 
MKEs • success rate with actual items as required by the An'goff 
and NbME procedures. However? thay thought it^was easier with 
actual items than with hypothetical item-types because they were 
tangible and could be examined both for content and format. Only 
one said it was easier to judge sjccpss rate£ jfor hypothetical 
item-types and gave as a reason the conceptual standardization 
imposed by the relevance and difficulty characterizations. 

The standard- se tter s expressed the desire for concrete 
information about the items on *hich they were basing their 
judgments. They wanted refer'enc^ points to keep them •••••in 
touch with reality." Although ^p-values for the items were 
available based on the responses of examinees in t/ie National 
Board reference group? they were not made known^ to the 
standard-setters for fear that such information miqht bias their 
judgments about the MKEs # success rates. 

The s tandar d-set ters • opinions suggest that it would not be 
feasible to use the Fbel procedure to estimate the standard for 
operational examination programs. Psychometric • ans are likely to 
concur in that opinion since the ?xtra effort required of the 
standard-setters and psychometr i ci ans when using Ebel f s procedure 
did not yield an estimate that was more consistent or plausible. 
The other procedures require only one judgment? not three? and 
presented no prpbleirs either to tiie s tandar d- setter s or 
psychomet r i cian s that would detract from the feasibility of using 
them with on-joing examination programs. 



DISCUSS I3N 

The estimate of the examination standard obtained using th<* 
NBME procedure was the most consistent of the four. It was less 
sensitive to aberrant judgments ibout individual items -- low 
success rates for easy items or high success rates for hard ones 
— becajse it fits a regression line* through the mean ^difficulty 
of i terns judged to have the same success rate rather than giving 
equal weight to every judgment of an MKEs • success rate as is the 
case with the Angoff anb Ebel procedures. In this way it 
diminishes the impact of aberrant judgments on the estimation of 
the personal standard. Thus? the error inherent in the 
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estimation of the examination standard is reduced and? to a 
greater degree than with other procedures? that^ error reflects 
actual differences' among the T per sopal '* standards of the 
standard-setters rather than inconsistencies associated with 
thenr estimation* 

Only the Gueri n procedure yielded an estimate of the 
examination standard that obvipusly was not plausible since a 
fai lure rate "of 85.5% in the Natiohcil Board ' reference group 
clearly would not 'be acceptable. Although the present study uspd 
on^ y single word definitions to di sti ngui sh among* four degrees 6f 
relevance and Guerin provided his st^rrdard- setter s with detailed 
descriptions of five degrees of relevance? it seems more 
reasonable to explain this finding in terms of the psychometric 
differences between^ the Part II Examination and f: t^e 
recer t i f i c at i on examinations Gtferin and. his colleagues worked 
with. A typi'cal recer ti f i cat i on examination is likely to contain 
a larger percentage of relatively easy items than would be found 
in'a National Board ex^miftaf Von ; therefore? the most difficult of 
the highly relevant items is^more likely to Ve easier relative ta 
the calibration scale of the recert i f i cat i on examination than the 
Part II Examiryati on. This tendency probably was encouraged by 
Guerin*s description of highly' relevant items which suggested 
that they are likely to* assess content most examinees know. 
Consequently? in Guerin f s studies the most relevant items tended" 
to be the easiest ones? resulting in the ^examination standards 
for the recert i f-i cat i on examinations be/ng set at a lower 
achievement level than those for the Part II Examination^ 

The data concerning the feasibility of implementing the 
standard setting procedures were a mixture of opinions. The 
standard-setters found it easier to judge the relevance o"f 
individual test items than the MKEs* success rate with those 
items? and they were less confident about judging the MKEs f 
success rate with hypothetical item-types than wi th ^actual test 
items.* However? the Ebel procedure which requires judgments of 
item relevance and difficulty and of success rates with groups of 
hypothetical^! tern-types yielded tie most accurate estimate of the 
average scoYe for the National Board reference group,. This 
occurred even though the standard-setters said they were 
uncertain about the impact of changes in iteij relevance and 
difficulty on their judgments about' success rates and suggests 
that it may be more efficacious for standard- setters to Judge 
success rates for groups of items with commorrx^char acter i s t i cs 
than for individual i ferns * J % 

It was not an objective of this study to determine which 
criterion-referenced procedures yielded estimates that closely 
approximated the norm-referenced standard used by the National 
Board? but the similarity between the normative standard and 
three of the estimates requires comment. The standard-set.ter s 
used in this study were not involved in the process by which the 
National Board determined its norn-referenced standard* and the 
judgments they expressed in this sttMly were different from the 
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judgments used to determine that standard. Therefore? the fact 
that the normative standard is closely approximated by three 
estimates of the* cr it er i on-referenced standard suggests that? in 
the judgment of this group of standard-setters? it too may be a 
reasonable estimate of a criterion-referenced standard 
represent i ng "mi nimal ly acceptable ach ievement." ' 



SUMMARY? CONCLUSIONS? AND RECOMMENDATIONS 

Four procedures were used to estimate a criterion-referenced 
standard for a multiple-choice examination developed by the 
National Board of Medical Examiners. The procedures were 
evaluated on the basi s of their consistency? plausibility? and 
feas i bi,l i ty • t 

'J y 

The NBME procedure yielded the most consistent? and the 
Guer i n procedure the least consistent? estimate of the 
examination standard. The consistency of all the estimates 
increased when the examination standard was computed as the 
average of discipline standards rather than personal standards* 

□ql y the ^Guer i n procedure y iel ded an est i mate of the 
examination. standard that clearly was not plausible since 85«5% 
of the NBME reference group would have failed had it been used* 
The outcome's associated with the other estimates could be 
considered plausible since they did not differ greatly from the 
2.*f% failure rate experienced with the normative standard* 

There is reason to doubt the accuracy of the 
standard-setters* judgments about the MKEs • success— -rates with 
individual items ' since their judgments about the reference 
group's success rates were inaccurate for roughly 45% of the 
items. This raises questions about the validity of the standard 
estimated using those Judgments? especially since 
stand.ard-setter s who made the least accurate judgments about the 
TUSMGs • success rate also had personal -standards that were 
extreme* These findings and a desire for information about the 
items 1 performance suggest the need to provide some guidance to 
the standar d-setter s as they j,udge the MKEs • success rate with 
test quest i ons« 

The Angoff? Guerfn and NBME procedures were considered 
feasible for use in estimating the standard with operational' 
examination programs because they presented no unusual problems 
either to the standard-setters or the psychometric i ans • The Ebel 
procedure was not cons i dered feas i bl e ? because the 
standard-setters' found it difficult and confusing to judge 
success rates for hypothetical item-types characterized by 
varying degrees of relevance and d i f f i cul ty ^* However ? there was 
some evidence to suggest that making judgments about groups of 
items rather than individual items may be advantageous* 
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Based on these f indi ngs it is recommended that the NBME 
procedure continue to be investigated. It appeared to be more 
promising than the other methods investigated? and the findings 
of this study suggest that several mod i f i cat i ons" may enhance its 
attractiveness as a procedure for estimating a 
criterion-referenced standard* 

One modification would estimate the . standard- setter s ' 
personal standards using only those items for which their 
judgment of the TUSflGs' success rate was a reasonably accurate 
estimate of the reference group's p-value« Another would have 
the standard-setters judge the MKE s • success rate with clusters 
of highly relevant items of varying difficulty rather than with 
individual items* thereby retaining the more desirable features 
of the Ebel procedure while eliminating the potential for 
confusion arising from the use of items that in the 
standard-setter's judgment are not relevant. A third 
modification would provide the standard-setters the opportunity 
to review their previous judgments in the light of v feedback about 
the MKEs • success ratd implied by the current estimate of their 
personal standard arid the group's examination standard* 

These modifications could yield more consistent estimates of 
the judges' personal standards and of the group's examination 
standard. Furthermore! the use of feedback could provide %he 
standard-setters with a mechanism for refining the estimate of 
their personal standards and for approaching consensus about the 
examination standard in an atmosphe/e of reasoned and 
deliberative judgment devoid of heated argument and advocacy. 
Studies involving these modifications have been planned. 
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FIGURE 1 

Illustration of the NBME Procedure 
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; P = exp(b-d)/[l + exp(b-d)] 

■i 

(b-d) = log [P/O-P)] 
RegresB d on log (P/(l-P)] 
Then, the intercept occurs at (b-d) = 0 (i.e., where b = d) 



18 



r 



j 



06/01/82 



P4CE 13 



HU NCMMF I CUE AT ION CAT £ - 06/01/8?) 

^^iCAT T Efi GftAH Jlf ULCHW XALQIF CALIBRATION OifflCLl 13L. 

-1.65 -0.S5 -Q.25 0.45 

1*QQ 



1.15 



.ACROSS) LGGMCE - BOE LOG ODDS SUCCESS 

1.85 2. 55 1.25 9.95 4.65 



( : 



/ 'I 

s 



2.20 



1.40 



-0.20 




- 1.00 



-I* 80 



-2.60 



-3.40 



-4-2CL 



• 



Regression Line 



i&Sl 



-S.00 ♦ . i . r. VC, 

T i gtg-atttagagtagM 

-2.00 -1.30 -O.SQ O«}0 




MO 



3.00 



2.20 



1 .40 



0.60 



-O.?0 



-1.00 



-2.60 



-3.40 



T*a20 



-5,00 



5.00 



ERIC 



1j 



20 



1 




CORRELATION <R|- -0.28685 R SQUARED - 0.08228 SIGNIFICANCE - 3.00054 



ST 6 ERR OFTsT - "0^4237 [^^^^^^^^^^^^^^^^ ^^P^^tlTT* " ^oTjH69 
PIOTTFO VALUES - l42-_ / IXClifflfirt Viill&tffcv ^f'^k ',^£&-^' X ^tilMm' \j*t»% * 10— - 




FIGURE 2 



An Illustration of the Guerin Procedure 
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TABLE 1 



ESTIMATES OF PERSONAL STANDARDS AND THE EXAMINATION STANDARD 



Standard- - — ; Standard Setting Procedure 



Setters 


NBME 


Ebel 


Angof f 


Guer i n 


A 

B 


(MED1 ) 
(MED2 ) 


50.6 
* 61.0 


48.6 
54.9 


50.4 
66.1 


57.7 
• 84.1 


C 
D 


(SURGl) 
(SURG2) 


52.6 
51.7 


59.3 
56.6 


60.6 
56.7 


78.0 
83.1 


•s 


(OBGYNl) 
(OBGYN2 ) 


58.1 
53.2 


66.1 
57.9 


65.3 
53.5 


85.6 
85.5 


G 
H 


(PMPH1 ) 
(PMPH2) 


41.1 
65.0 


36.5 
70.1 


41.5 
69.3 


69.3 
56.1 


I 
J 


(PEDSi) 
( PEDS2 ) 


50.8 
43.3 


44.8 
47.2 


51.8 
- 40.4 


83.4 
63.6 


K 
L 


(PSYCHl) 
(PSYCH2) 


50.1 , 
50.8 


52.6 
56.0 


52.6 
56.6 


76»1 
55.3 


GROUP 

Standard 
SO 

Stdm Error 


52.4 
6.4 
1.8 


54.2 
8.8 
2.5 


55.4 
8.7 
2.5 


73.2 
11.6 
3.3 


Standard - 2 SE 
Standard ♦ 2 SE 

. i - 1 
/ 


48.8 
56.0 


49.2 
59.2 


50. 4 
'• 60.4 


66.6 
79.8 
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TABLE 2 

ESTIMATES OF THE DISCIPLINE STANDARDS AND THE EXAMINATION STANDARD 



sc i pi i.ne 


NBME 


Ebel 


Setti ng_ Procedure— 
Angofr 




Med 


55.8 


51. B 


58.3 


70.9 


Sura 


- 52.2 


58.0 


.58.7 


80.6 


Ob/Gyn 


55.7 


62.0 


59.4 


85.6 


PMPH 


53.1 


53.3- 


55.4 


62.7 


Peds 


47.1 


46.0 


46.1 


73.5 


Psych 


50.5 


,54.3 


54.6 


65.7 


GROUP 

Standard 
SO 

Std. Error 


52.4 
3.0 
1.2 

i 


54.2 
5.0 

.2.0 


55.4 
4.5 
* 1.8 


73.2 
7.9 
3.2 


Standard - 2 SE 50.0 
Standard ♦ 2 SE 54.8 


. 50.2 
58.2 


51.8 
59.0 


66.8 
79.6. 
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TABLE 3 

FAILURE RATES ASSOCIATED WITH ESTIMATES OF THE 
EXAMINATION STANOARO 



Nor in- 
Referenced 
Est i mate 



Criterion-Referenced Estimate 
NBME Ebel Angoff Guerin 



Est i mated 
Standard 



50.5 



52.4 



54.2 



55.4 



73.2 



Reference Group 2.4% 
Failure Rate (n=113) 



3. 5% 

(n=169) 



6.0% 

(n=289) 



8.3* 
(n*400) 



85.8% 
(n=41I3) 
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« , TABLE 4 

DIFFERENCE BETWEEN fUSMG JUOGMEMT ANO REFERENCE GROUP P-VALUE 



Magnitude of the Difference ^ 



Less than or 
Equal to 10 



Between 
11 and 15 



Greater than 
15 



Mean Difference* 



Standard- 
















Setters 


n 


% 


n 


% 


n 


% 


TUSMG - P 


A (MED1) 


56 


39% 


25 


18% 


62 


43% 


00 


B (MED2) 


53 


37% 


22 


15* 


69 


48% 


tl 


C ,(SURG1) 


57 


40% 


, 20 


141 


65 


46% 


00 


0 (SURG2) 


57 


40% 


IB 


13* 


v 67 


47% 


13 


E (0BGYN1) 


54 


39% 


21 


15* 


63 


46% 


06 


F (OBGYN2) 


55 


41% 


19 


14* 


61 


45% 


07 


G (PMPH1) 


52 


35% 


27 


18* 


69 


47% 


-06 


H (PMPH2) 


71 


48% 


22 


15* 


55 


37% 


11 


I (PE0S1) 


61 


40% 


19 


14* 


69 


4*% 


03 


J (PEDS2) 


57 


39% 


16 


10* 


76 


51% 


-07 


K (PSYCH1) 


63 


45% 


26 


19* 


51 


36% 


04 


L (PSYCH2) 


60 


43% 


17 


12* 


63 


45% 


02 



*A negative difference indicates that the standard-setter 
TUSMGs* success rate and a positive difference indicates 
setter overestimated the TUSMGs • success rate. 



underestimated the 
that the standard- 
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TABLE 5 



ESTIMATES OF THE AVERAGE SCORE 
IN THE NATIONAL BOARD REFERENCE GROUP 
(Average = 65.4%; Percentile Rank = 49th) 



Standard- 
Setters 



Estimation Procedure 
NBME Ebel Anqoff 



A 


(MEOl) 


60.6 


B 


(MED2) 


65.7 


C 


( SURG I ) 


- 54.0 


D 


(SURG2) 


61.0 


E 


(OBGYN1 ) 


60.4 


F 


(QBGYN2) 


60.8 


G 


(PMPH1 ) 


57.6 


• H 


(PMPH2 ) 


71.5 


I 


(PE0S1 ) 


57.7 


J 


(PE0S2 ) 


54.2 


K 


( PSYCH 1) 


62.5 


L 


( PSYCH 2 ) 


54.9 



59.9 
72.3 

70.5 
75.1 

68.2 
70.1 

54.2 
(No Data) 

57.1 
58.3 

69.4 
61.5 



65.5 
76.2 

65.5 
78.4 

71.7 
72.3 

59.7 
77.4 

68.2 
58.9 

70.0 
67.3 



Group Average 
(Percentile Rank) 
SO 

Stnd. Error 

Average - 2 SE ' 
Average ♦ 2 SE v 



60.1 
(21st) 
4.8 
1.4 

57.3 
62.9 



65.1 
(48th) 
6.8 
2.0 

61.1 
69.1 



69.3 
(70th) 
6.1 
1.8 

65.7 
72.9 
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